Problems with Chinchilla Approach 2

Systematic biases in scaling law inference from IsoFLOP parabola fits

Artistic rendering of IsoFLOP curves

Motivation

Chinchilla Approach 2 is arguably the most widely adopted method for fitting scaling laws in practice today. Introduced in the original Chinchilla paper[1], it has since been used by leading AI labs including DeepMind[1],[7] (its creators), Meta[2],[9], DeepSeek[3], Microsoft[4], Amazon[6], Waymo[8], and Arc Institute[5], among others. It is also a workhorse method for academic scaling law studies[10],[11],[12] and high-profile practitioner tutorials from researchers like Andrej Karpathy.

The method's appeal lies in its stability and data efficiency relative to nonlinear optimization over all loss surface parameters. Rather than fitting all five parameters of the loss surface simultaneously, Approach 2 targets only the two scaling exponents, relying on second-order Taylor approximations that reduce each IsoFLOP curve to a simple parabola. This sacrifices recovery of the full loss surface but makes estimation far more stable and data-efficient, letting practitioners extract the most actionable quantities for compute allocation planning through a sequence of straightforward polynomial and linear fits, without ever touching a nonlinear optimizer.

To our knowledge, the sensitivity of these approximations and the method's behavior on loss surfaces that are less symmetric than the original Chinchilla form (where parameter and token scaling exponents are roughly equal) have not been studied in detail. This article investigates that gap through noise-free synthetic simulations that isolate systematic biases inherent to the method itself by eliminating all sources of statistical noise.

We show how these biases affect downstream decisions like dataset size selection for final training runs at large compute budgets. We show how extrapolation errors trace back to suboptimal IsoFLOP experiment design, and that pathologies in these designs can be observed in real, high-profile scaling law studies even if they are difficult to quantify precisely. Finally, we propose an alternative fitting method that is simple, stable, and free of these biases while building on the same intuitive computational shortcut: optimizing exponential terms separately from linear terms.

Preliminaries: Loss Surface, Notation, and Fitting Methods

Neural scaling laws describe how model performance improves with compute. The Chinchilla loss surface models this relationship as:

\[ L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} \]

where \(N\) is the number of parameters, \(D\) is the number of training tokens, \(E\) is the irreducible loss, and \(A, B, \alpha, \beta\) capture how quickly performance improves with scale.

Given a compute budget \(C \approx 6ND\), the optimal allocation satisfies:

\[ N^* \propto C^a \quad \text{where} \quad a = \frac{\beta}{\alpha + \beta} \] \[ D^* \propto C^b \quad \text{where} \quad b = \frac{\alpha}{\alpha + \beta} \]

Recovering the exponents \(a\) and \(b\) from empirical training runs is crucial for planning efficient large-scale training. Two canonical approaches exist:

Approach 2: IsoFLOP Parabolic Fitting

This method is presented in the Chinchilla paper. The key insight is that along a fixed-compute contour (IsoFLOP curve), loss as a function of \(\log N\) is approximately parabolic near the optimum.

  1. Sample IsoFLOP contours: For each compute budget \(C\), train models at various \((N, D)\) pairs satisfying \(C = 6ND\)
  2. Fit parabolas: For each budget, fit \(L = p(\log N)^2 + q(\log N) + r\) and extract the minimum \(N^*\)
  3. Fit power laws: Regress \(\log N^*\) against \(\log C\) to recover the exponent \(a\) (and similarly for \(D^*\), \(b\))

The appeal is simplicity: only polynomial fits, no nonlinear optimization. The parabolic approximation comes from a Taylor expansion of the loss surface around the optimum.

Approach 3: Direct Surface Fitting

The alternative is to fit all five parameters \((E, A, B, \alpha, \beta)\) simultaneously via nonlinear least squares. This avoids the parabolic approximation entirely but is notoriously unstable: highly sensitive to initialization and prone to converging to spurious local minima.

The Happy Path: Symmetric Surfaces

Before examining failure modes, let's establish that Approach 2 works perfectly under ideal conditions. Consider a symmetric loss surface where \(\alpha = \beta\):

\[ L(N, D) = 1.69 + \frac{400}{N^{0.31}} + \frac{400}{D^{0.31}} \]

With equal exponents, the optimal allocation splits compute evenly between parameters and data. The true scaling exponents are:

\[ a = b = \frac{0.31}{0.31 + 0.31} = 0.5 \]

We sample five IsoFLOP contours spanning \(10^{17}\) to \(10^{21}\) FLOPs, fit parabolas to each, and extract the optimal token count \(D^*\).

Approach 2 on symmetric surface showing perfect recovery
Figure 1: Approach 2 applied to a symmetric loss surface. Left: IsoFLOP curves with fitted parabolas. True (×) and inferred (+) optima are indistinguishable. Right: Power-law fit recovers the exact scaling exponent.

The results confirm perfect recovery of the token scaling exponent and intercept:

Parameter True Value Inferred Value Relative Error
b (D* exponent) 0.500000 0.500000 +6.2×10⁻¹²%
b₀ (D* intercept) −0.389076 −0.389076 −1.4×10⁻¹⁰%
✓ Key Result

On a symmetric loss surface with perfectly crafted IsoFLOP grid sampling, Approach 2 recovers both exponents and intercepts with machine-precision accuracy. When \(\alpha = \beta\), the parabola vertex shift is zero, so the inferred optima coincide with the true optima.

This establishes our baseline: Approach 2 is precisely correct under ideal conditions that are unrealistic in practice. The problems arise when we deviate from these ideal conditions, as we'll see in the following sections where these conditions are perturbed in controlled ways.

Asymmetric Surfaces: Intercept and Extrapolation Errors

We repeat the exact same procedure as before: perfect sampling centers, no noise, identical methodology. The only change is that the loss surface is now asymmetric (\(\alpha \neq \beta\)).

What Happens

Simulation results show that when the loss surface is asymmetric, Approach 2 produces systematically wrong intercepts while exponents remain accurate. This isn't statistical noise; it's a deterministic bias from fitting parabolas to a non-parabolic surface.

We test two configurations to see how the effect scales:

The Asymmetric surface is not a contrived stress test. An exponent ratio of 3.0 is comparable to what has been observed in practice: DeepSeek[3] reports compute-optimal allocation exponents of \(a = 0.73\), \(b = 0.27\) for an OpenWebText2 variant, implying a loss surface exponent ratio of \(\beta / \alpha \approx 2.7\). The asymmetry runs in the opposite direction from our Asymmetric surface (\(\beta > \alpha\) rather than \(\alpha > \beta\)), but the degree of imbalance is similar, and it is the magnitude of the imbalance, not its direction, that drives the biases studied here.

Approach 2 on asymmetric surfaces showing intercept errors
Figure 2: Approach 2 on asymmetric loss surfaces. Note the visible gap between true (dashed) and inferred (solid) power-law lines in the Asymmetric case. The exponents match perfectly, but the intercepts differ.

Chinchilla Surface

Parameter True Value Inferred Value Relative Error
b (D* exponent) 0.548387 0.548387 ≈ 0%
b₀ (D* intercept) −0.555357 −0.578092 −4.1%

Asymmetric Surface

Parameter True Value Inferred Value Relative Error
b (D* exponent) 0.750000 0.750000 ≈ 0%
b₀ (D* intercept) −1.345791 −1.459957 −8.5%

Why This Is Surprising

A few percent error in the intercept might seem minor, but consider that this simulation gave Approach 2 every advantage. The data is perfect: no measurement noise, with every point lying exactly on the true loss surface. The sampling is perfect too, with IsoFLOP grids centered precisely at the true optimum (something you wouldn't know how to do in practice). And the parameters are standard, taken directly from the Chinchilla paper rather than contrived to expose a potentially unrealistic weakness.

✓ Key Result

Even under these ideal conditions, Approach 2 produces biased intercepts for asymmetric surfaces. The error is systematic, a property of the parabolic approximation, not statistical noise.

Why It Happens

The IsoFLOP loss curve is not a true parabola; it contains exponential terms. When a parabola is fit to this curve, the parabola's minimum (vertex) doesn't land exactly at the true optimum. It shifts slightly, and the key insight is that this shift depends only on the loss surface shape (\(\alpha\), \(\beta\)) and the sampling grid. It does not depend on compute budget. The sampling grid size becomes important here: wider grids amplify the mismatch between the true curve and its parabolic approximation, increasing the vertex shift.

Because the IsoFLOP parabola is fit in \(\log N\) space (as described in the Approach 2 procedure), the vertex shift directly biases \(N^*\). Since \(C = 6ND\), analyzing the bias in either \(N^*\) or \(D^*\) is sufficient; we focus on \(N^*\) here since that is where the parabolic fit typically operates.

Since the vertex shift is constant across all compute budgets, it biases every inferred \(N^*\) by the same multiplicative factor. When fitting \(\log N^*\) vs \(\log C\) to extract scaling exponents:

Exact derivation: The intercept error can be derived analytically in closed form. The parabola vertex shifts by \(\delta w\) (in log-space), giving an intercept error of:

\[ \text{Intercept error} = 10^{\delta w} - 1 \]

where \(\delta w = f(\alpha, \beta, W, n)\) depends only on the surface exponents and the sampling grid (width \(W\) in log-space, number of points \(n\) per IsoFLOP curve), not on \(C\), \(E\), \(A\), or \(B\). Here \(W\) spans \(10^{-W/2}\) to \(10^{W/2}\) times the optimal \(N^*\), so \(W = 2.41\) (the XL grid) means sampling from \(\frac{1}{16}\times\) to \(16\times\) the optimum. And \(n = 10\) means 10 model sizes per compute budget. Key properties:

For example, with the Chinchilla parameters (\(\alpha = 0.34\), \(\beta = 0.28\)): the XS grid (\(W = 0.60\)) yields 0.3% intercept error, while the XL grid (\(W = 2.41\)) yields 4.1% error.

The full derivation provides the closed-form expression for vertex shift \(\delta w\) as a function of \(\alpha\), \(\beta\), \(W\), and \(n\). It also shows how this shift translates directly into intercept error, independent of compute budget.

Intuition via Taylor expansion: A parabola is a 2nd-order polynomial, which is equivalent to a 2nd-order Taylor expansion around the optimum. The approximation \(L(w) \approx L(0) + \frac{1}{2}L''(0)w^2\) is only valid when higher-order terms are negligible, i.e., when samples are close to the true minimum. As sampling range increases, 3rd and 4th order terms grow. For symmetric surfaces (\(\alpha = \beta\)), odd-order terms cancel by symmetry, preserving the vertex location. For asymmetric surfaces, they don't cancel, shifting the fitted vertex away from the true optimum.

Why It Matters

Extrapolation to higher compute budgets requires both exponents and intercepts to be correct. The previous section established that asymmetric loss surfaces produce provably biased intercepts even under ideal experimental conditions. Here we quantify what those errors mean in practical terms by examining compute-optimal token prediction: given a compute budget, how many tokens does the inferred scaling law predict?

Up to this point, all analysis has assumed a single fixed sampling grid width. We now examine how token prediction error varies with both compute budget and sampling grid width. For surfaces with asymmetric exponents, wider sampling grids amplify the parabola-fitting mismatch, increasing the constant vertex shift and thus the intercept bias. To make this comparison concrete, we first define what "wider" and "narrower" mean in quantitative terms.

A sampling grid of "±kx" means the sampled values (whether model sizes or token counts) range from 1k to k times the true optimum at each compute budget. The total range covered is k² (the ratio of largest to smallest), and the log₁₀ of that ratio tells you how many factors of 10, or "decades," the grid spans end-to-end (e.g. a value of 1.81 means the largest sample is 101.81 ≈ 64x the smallest). The table below shows the four grid widths used in this analysis:

Grid Name ±kx Sampling Range Total Ratio Decade Span (factors of 10)
Extra Small (XS) ±2x 1/2x to 2x 4x 0.60
Small (S) ±4x 1/4x to 4x 16x 1.20
Large (L) ±8x 1/8x to 8x 64x 1.81
Extra Large (XL) ±16x 1/16x to 16x 256x 2.41

In practice, scaling law experiments typically sample across 1 to 2 decades in token count, placing the Small and Large grids squarely within the realistic range. The Extra Small and Extra Large grids bracket this range on either side, illustrating how the biases shrink or grow as the sampling window narrows or widens. The Extra Large grid (±16x, ~2.4 decades) is the default used in all single-grid analyses in the preceding sections.

Bar chart showing token prediction error by surface and grid width
Figure 3: Relative error in compute-optimal token prediction when extrapolating from the training range (10¹⁷-10²¹ FLOPs) to 10²⁴ FLOPs. Negative values indicate underestimation: the inferred scaling law predicts fewer tokens than optimal. Bars are grouped by sampling grid width. Annotations for the Chinchilla surface show \(D^*\) (true compute-optimal token count) versus \(\hat{D}^*\) (the Approach 2 estimate).
📊 View raw data
Surface α β Grid True D* Inferred D* Abs Error Rel Error
Symmetric Surface (α = β)
Symmetric 0.31 0.31 XS (±2×) 408.2B 408.2B ≈0 ≈0%
Symmetric 0.31 0.31 S (±4×) 408.2B 408.2B ≈0 ≈0%
Symmetric 0.31 0.31 L (±8×) 408.2B 408.2B ≈0 ≈0%
Symmetric 0.31 0.31 XL (±16×) 408.2B 408.2B ≈0 ≈0%
Chinchilla Surface (α ≠ β)
Chinchilla 0.34 0.28 XS (±2×) 4.04T 4.02T −13.2B −0.33%
Chinchilla 0.34 0.28 S (±4×) 4.04T 3.98T −52.5B −1.30%
Chinchilla 0.34 0.28 L (±8×) 4.04T 3.92T −117.2B −2.90%
Chinchilla 0.34 0.28 XL (±16×) 4.04T 3.83T −205.8B −5.10%
Asymmetric Surface (α/β = 3)
Asymmetric 0.465 0.155 XS (±2×) 45.1Q 44.3Q −755.4T −1.67%
Asymmetric 0.465 0.155 S (±4×) 45.1Q 42.2Q −2.9Q −6.50%
Asymmetric 0.465 0.155 L (±8×) 45.1Q 38.8Q −6.3Q −13.91%
Asymmetric 0.465 0.155 XL (±16×) 45.1Q 34.7Q −10.4Q −23.12%

B = billion, T = trillion, Q = quadrillion. Hover over cells for full-precision values. Training range: 10¹⁷–10²¹ FLOPs. Evaluation budget: 10²⁴ FLOPs.

The key observations from this figure are:

✓ Key Result

Consider the Chinchilla surface with the Large grid (±8x), a practical sampling range for real experiments. When extrapolating to 10²⁴ FLOPs, the true optimal token count is 4.04 trillion, but Approach 2 predicts only 3.92 trillion: a 2.9% underestimate, or roughly 117 billion fewer tokens than optimal. While 2.9% may seem modest, recall that this simulation uses unrealistically ideal conditions: perfectly centered sampling grids at every compute budget and zero measurement noise. Real experiments, where the true optimum is unknown, data is noisy, and the scaling exponent imbalance may be larger than Chinchilla's modest \(\alpha/\beta \approx 1.2\), can only do worse.

Off-Center Sampling: Exponent and Extrapolation Errors

The previous sections assumed perfectly centered sampling: at every compute budget, the IsoFLOP grid was placed exactly at the true optimum. In practice, you don't know \(N^*\) before running the experiment. Sampling centers are guesses, informed by prior estimates or heuristics, and they will inevitably be wrong by some amount.

This is a distinct source of error from the asymmetry bias examined earlier. Asymmetry errors arise from the shape of the loss surface (\(\alpha \neq \beta\)); off-center errors arise from where you place the sampling grid. To isolate this new effect, we return to the symmetric surface (\(\alpha = \beta = 0.31\)) where asymmetry bias is zero by construction.

Constant Multiplicative Bias

The simplest form of off-center sampling is a constant multiplicative offset: every compute budget's sampling center is shifted by the same factor from the true optimum. A "3× offset" means each IsoFLOP grid is centered at \(3 \times D^*\) instead of \(D^*\), so the grid midpoint consistently sits at three times the true optimal token count.

Because this offset is the same at every compute budget, it has a familiar geometric effect: each parabola vertex shifts by a constant amount in log-space. This is the same mechanism as asymmetry bias. The slope of \(\log D^*\) vs \(\log C\) is unaffected (a constant additive shift in log-space doesn't change the slope), so the scaling exponent is preserved perfectly. The intercept, however, absorbs the entire error.

Off-center sampling with constant multiplicative bias showing zero exponent error but systematic intercept error
Figure 4: Effect of a constant 3× offset in sampling centers on the symmetric surface. Top left: IsoFLOP curves at the Large grid (±8×), with black diamonds marking the (off-center) sampling center, red × the true \(D^*\), and blue + the inferred \(D^*\). Top right: extrapolation error in compute-optimal token prediction at 10²⁴ FLOPs for each grid width, using the same XS through XL grids defined earlier. Bottom row: exponent and intercept errors across grid widths from XS (±2×) to XL (±16×), plotted on the same y-axis scale. The exponent is recovered perfectly (flat at zero) while the intercept shows systematic bias that varies with grid width.

The extrapolation bar chart (top right) shows what this means for token prediction: all four grid widths overestimate \(D^*\), with the narrowest grid (XS) producing the largest error. This is the reverse of the asymmetry bias pattern, where wider grids amplified error. Here, narrower grids are more sensitive to off-center placement because fewer samples lie near the true optimum.

The intercept error panel (bottom right) confirms the pattern across the full continuum of grid widths. The error is always positive (the inferred \(D^*\) overshoots) and decreases monotonically as the grid widens, reflecting how a wider sampling range brings more of the true loss curve's shape into the fit, partially compensating for the misplaced center.

✓ Key Result

Consider the symmetric surface with the Large grid (±8×) and a 3× offset, where every IsoFLOP grid is centered at three times the true optimal token count. When extrapolating to 10²⁴ FLOPs, the true optimal token count is 408.2 billion, but Approach 2 predicts 419.0 billion: a 2.6% overestimate, roughly 10.8 billion more tokens than optimal. Compare this with the Chinchilla asymmetry result at the same grid width: a 2.9% underestimate. The magnitudes are comparable, but the sources are entirely different. Asymmetry bias comes from the shape of the loss surface; off-center bias comes from where you place the grid. In a real experiment, both act simultaneously.

Drifting Bias

When the offset varies with compute budget, a qualitatively different failure mode emerges. To illustrate this, we apply a linear drift: the sampling center starts at the true optimum for the lowest budget and drifts to 3× the true optimum at the highest budget, interpolating linearly in log-compute space.

Because the offset now differs across compute budgets, it no longer cancels in the slope of \(\log D^*\) vs \(\log C\). Both the exponent and the intercept are affected.

Off-center sampling with drifting bias showing both exponent and intercept errors
Figure 5: Effect of a linear drift in sampling centers (centered at true optimum for lowest budget, drifting to 3× at highest budget) on the symmetric surface. Unlike the constant bias case, the exponent error (bottom left) is now non-zero: the slope of \(\log D^*\) vs \(\log C\) is distorted because the offset varies across compute budgets.

Compare the bottom-left panels of Figures 4 and 5: constant bias produces a flat line at zero (exponent preserved), while drifting bias produces a non-zero exponent error that varies with grid width.

✓ Key Message

Constant bias preserves exponents; any compute-dependent bias pattern distorts them. The distinction matters because exponent errors compound during extrapolation, while intercept errors remain fixed.

IsoFLOP Curves in the Wild: Evidence from Published Studies

The previous sections used synthetic, noise-free simulations to isolate Approach 2's biases under controlled conditions. A natural question is whether the conditions that trigger these biases, asymmetric loss surfaces and imperfectly centered sampling, actually arise in practice. To get a sense of this, we can look at IsoFLOP curves published in three of the most prominent scaling law studies[1],[2],[3].

IsoFLOP curves from Chinchilla, Llama 3, and DeepSeek scaling law papers
Figure 6: IsoFLOP curves from three published scaling law studies. Left: Chinchilla (training loss vs parameters). Center: Llama 3 (validation loss vs training tokens). Right: DeepSeek (bits-per-byte vs FLOPs/token). Each panel shows curves at multiple compute budgets, fit using Approach 2.

Several features relevant to the biases studied in this article are visible across all three panels:

To be clear, this is not a criticism of these studies. These are among the most careful and influential scaling law analyses published. The point is a more general one: the conditions under which Approach 2's biases activate, asymmetric surfaces and imperfect sampling centers, appear to be the norm rather than the exception. The idealized conditions of the Happy Path (symmetric surface, perfectly centered grids) are the special case.

Compounding Errors

Given evidence that both surface asymmetry and off-center sampling are present in real studies, we can simulate what happens when these biases act simultaneously. Using the same three loss surfaces and five sampling bias configurations from earlier sections, we fit Approach 2 on compute budgets from 1017 to 1021 FLOPs and extrapolate \(D^*\) predictions out to 1025 FLOPs. Results are shown at two grid widths that bracket the realistic range: XS (±2×) and XL (±16×).

Combined extrapolation error across grid widths, surfaces, and sampling biases
Figure 7 (research image): TODO: Determine final presentation and layout. Relative error in \(D^*\) when extrapolating beyond the fitting range, with asymmetry and sampling biases acting simultaneously. Rows = XS and XL grids; columns = symmetric, Chinchilla, and Asymmetric surfaces; one curve per bias configuration (baseline, drift, constant offset).

At narrow grids (XS), the parabolic approximation is tight and asymmetry bias is negligible; errors are dominated by sampling biases, with drift creating errors that grow with extrapolation distance (reaching roughly 25% on the Asymmetric surface and 8% on the Chinchilla surface at 1025 FLOPs). At wider grids (XL), asymmetry bias dominates: the perfect-sampling baseline underestimates \(D^*\) by about 5% on the Chinchilla surface and about 20% on the Asymmetric surface. In these particular configurations, the sampling biases happen to partially offset the asymmetry error, but this is not guaranteed: the direction of each bias source depends on the sign and magnitude of the offset, and the two can just as easily compound as cancel.

TODO: Add a configuration where the bias sources reinforce rather than offset each other (e.g. an offset direction that pushes in the same direction as asymmetry error), to demonstrate the compounding case directly.

✓ Key Result

Multiple bias sources act simultaneously in any real experiment. Even individually, asymmetry and sampling biases each produce meaningful errors; when they happen to align, the combined error can exceed either one alone. At practical grid widths with Chinchilla-like asymmetry, errors of 5% or more in \(D^*\) are typical, and on more asymmetric surfaces the errors reach 20% or more.

Robust Fits: Unbiased Estimation with Linear Separation

The previous sections showed that Approach 2's parabolic approximation introduces systematic biases in intercepts (from asymmetry) and potentially exponents (from off-center sampling), and that the conditions driving these biases are visible in published scaling law studies. The natural alternative is Approach 3, which fits all five surface parameters \((E, A, B, \alpha, \beta)\) simultaneously via nonlinear least squares. This avoids the parabolic approximation entirely but brings its own set of problems.

Problems with Direct Surface Fitting

A recent survey of over 50 scaling law papers[13] documents the landscape of fitting practices and their failure modes. The problems described below apply to scaling law fitting in general, not just Chinchilla forms, but they are directly relevant because Approach 3 involves the same kind of nonlinear optimization. Over half of the papers surveyed do not fully specify their fitting procedure (optimizer, loss function, or initialization), which compounds reproducibility challenges.

The most common optimizers for scaling law fits are BFGS and L-BFGS. Some studies use SGD-family optimizers like Adam and Adagrad, though these are noted as sometimes poorly suited for curve fitting due to limited data efficiency. At least one study[14] forgoes optimization entirely in favor of pure grid search because fitted solutions are too unstable.

In practice, this instability takes several forms. Results are sensitive to initialization: different starting points for the optimizer can lead to substantially different fitted parameters. Results are also sensitive to optimizer hyperparameters such as convergence tolerance and gradient estimation method. And the optimizer frequently converges to local minima rather than the global optimum.

Initialization is the most studied source of variability. Common mitigations include grid search over thousands of starting points (running the optimizer from each and keeping the best fit), random sampling of starting points, evaluating a coarse grid without optimization and seeding the optimizer from the single best candidate, or initializing from previously published parameter values. None of these reliably solve the problem. The survey's own experiments show that full-grid optimization over 4500 starting points sometimes yields the worst fit among all strategies tested, evidence of "the difficulty of optimizing over this space, and the presence of many local minima."

A simpler alternative is to log-linearize the power law and fit with linear regression. However, the log transformation changes the error distribution and exaggerates errors at small loss values, biasing parameter estimates. This bias is easily observed in simulations like ours. The survey also finds that the choice of loss function (whether Log-Huber, Huber, MSE, or MAE) affects fitted parameters unpredictably across datasets, and non-MSE objectives can introduce systematic bias in parameter estimates. Our goal is to identify a fitting method that is simple, stable, and efficient rather than to address outliers or other statistical concerns, so we use MSE for all fits in this article.

The survey's experimental analysis varies optimizer, loss function, and initialization strategy across three datasets. The overarching finding is that none of these choices reliably eliminates instability, and results shift unpredictably between datasets. A key contributor is the high dimensionality of the joint five-parameter optimization, which creates a complex loss landscape with many local minima and interacting sensitivities. Reducing the dimensionality of the nonlinear search is one way to make the problem more tractable.

Variable Projection (VPNLS)

The Chinchilla loss surface has a partially linear structure that can be exploited. For any fixed values of \(\alpha\) and \(\beta\), the remaining parameters \((E, A, B)\) enter the model linearly and can be solved exactly via least squares. This is the same computational shortcut that motivates Approach 2 (optimizing exponential terms separately from linear terms), but applied here without the parabolic approximation.

The algorithm searches over \((\alpha, \beta)\) and, at each candidate pair, solves for \((E, A, B)\) via non-negative least squares (NNLS). A coarse 32×32 grid search identifies a good starting region, and a Nelder-Mead simplex optimizer refines it. The linear separation is maintained throughout: the optimizer only ever navigates the two-dimensional \((\alpha, \beta)\) surface, never the full five-parameter space. We call this method VPNLS (Variable Projection with Non-negative Least Squares).

function VPNLS(data):

    function objective(α, β):
        X ← [1, N^(-α), D^(-β)]         // design matrix, one row per observation
        (E, A, B) ← NNLS(X, L)          // linear solve with E, A, B ≥ 0
        return ‖L − X·[E, A, B]‖²

    (α₀, β₀) ← argmin objective(α, β)  // coarse 32×32 grid search
    (α*, β*) ← NelderMead(objective,    // refine in 2D only
                           start=(α₀, β₀))
    (E*, A*, B*) ← NNLS(X(α*, β*), L)  // recover linear params at solution

    return (E*, A*, B*, α*, β*)

The choice of Nelder-Mead over L-BFGS-B is deliberate. VPNLS uses NNLS for the inner solve to guarantee that \(E\), \(A\), and \(B\) remain non-negative, preventing physically meaningless fits. However, NNLS has no closed-form gradient with respect to the outer parameters \((\alpha, \beta)\). Switching to ordinary least squares would restore differentiability but cannot enforce non-negativity. With NNLS, L-BFGS-B must rely on finite-difference gradients, which creates a set of interacting tuning parameters (eps, jac, ftol, gtol, maxcor, maxls) where tight tolerances demand gradient accuracy that finite differences cannot reliably provide.

Nelder-Mead avoids this entirely. Its few settings (xatol, fatol) are independent and work well out of the box. Nelder-Mead scales poorly to high dimensions, but variable projection reduces the search to just two dimensions, which is exactly the regime where simplex methods excel.

Method Comparison

To validate this choice, we compare nine method configurations on noise-free synthetic data across three loss surfaces (symmetric, Chinchilla, and high imbalance) and 20 sampling ranges. This is the best case for gradient-based methods since the data contains no noise that could obscure gradient information.

The configurations fall into two groups. The first uses 5D direct optimization (Approach 3), fitting all five parameters jointly with L-BFGS-B using either analytical gradients, forward finite differences, or central finite differences. The second uses 2D variable projection over \((\alpha, \beta)\) only, comparing VPNLS (Nelder-Mead), L-BFGS-B with four finite-difference configurations (default \(\varepsilon\), central differences, \(\varepsilon = 10^{-6}\), and \(\varepsilon = 10^{-10}\)), and a fine 256² grid search with no local refinement.

Method comparison showing geometric mean error and max error across nine optimizer configurations
Figure 8: Comparison of nine fitting methods on noise-free synthetic data across three loss surfaces and 20 sampling ranges (60 fits total per method). Left: geometric mean of |relative error| (%) pooled across all surfaces, grid widths, and parameters, with horizontal bars spanning the min-to-max range. Filled dots indicate convergence on all 60 fits; open dots indicate at least one failure (count annotated). Right: maximum |relative error| (%) per parameter over successful fits, on a log-scale colormap. Methods are sorted by geometric mean error, with the worst at top.

In the left panel, each dot shows the typical (geometric mean) parameter recovery error for one method, and the horizontal bar shows the range from best to worst case across 60 scenarios. The right panel breaks this down by parameter, showing the worst-case error for each.

Consider the best Approach 3 configuration (5D L-BFGS-B with analytical gradients). Even with exact gradients on noise-free data, the worst-case errors reach about 5% for the scaling exponent \(\alpha\) and about 2% for the irreducible loss \(E\). While a few percent may appear modest, the preceding sections show that errors of this magnitude in scaling parameters translate into meaningful distortions when extrapolating compute-optimal predictions to higher budgets. VPNLS recovers all five parameters with errors on the order of 10−8%, effectively eliminating parameter estimation as a source of extrapolation error. Figure A1 breaks this down by surface and sampling range, also revealing that Approach 3's errors can vary systematically with sampling range on certain surfaces.

Looking at the full set of methods, a clear hierarchy emerges. High-resolution grid search (256²) is stable across all conditions but provides the poorest overall precision among 2D methods, limited by grid resolution.

5D direct optimization (Approach 3) is more accurate on average than grid search but highly variable across conditions. The 5D configurations that rely on finite-difference gradients rather than analytical gradients perform particularly poorly and serve as a useful negative control. They demonstrate what high variability and instability look like, and Approach 3 with analytical gradients exhibits a similar pattern at somewhat lower magnitude. The full per-parameter breakdown (Figure A1) shows these instability patterns in detail.

L-BFGS-B with 2D variable projection can match VPNLS precision, but the optimizer fails to converge in a non-trivial fraction of scenarios even in this relatively small test suite. The choice of finite-difference scheme matters considerably. Switching from forward to 3-point central differences closes the precision gap with Nelder-Mead (from roughly 10−5% to 10−8% error), but introduces sporadic line search failures. Notably, these failures can be false positives. The optimizer has already reached the true minimum, with residual sum of squares near machine zero, but the line search cannot verify further progress because function values are too small to distinguish. In scipy, this surfaces as result.success = False with an ABNORMAL status from scipy.optimize.minimize, even though the returned parameters are correct.

L-BFGS-B remains a viable alternative to Nelder-Mead for practitioners willing to tune settings carefully and who understand that certain convergence errors from scipy are not necessarily problematic. That said, VPNLS with Nelder-Mead is simpler, requires less tuning, and recovers parameter estimates with precision at least as high as any other method tested. It technically achieves the most precise estimates, though the margin over a well-configured L-BFGS-B with 3-point central differences is small.

📊 View method comparison data
Method Failures Max E err% Max A err% Max B err% Max α err% Max β err%
2D Nelder-Mead (VPNLS) 0/60 5.2×10⁻⁸ 6.3×10⁻⁸ 7.9×10⁻⁸ 1.2×10⁻⁸ 2.0×10⁻⁸
2D L-BFGS-B (central diff) 1/60 8.3×10⁻⁸ 5.3×10⁻⁸ 6.4×10⁻⁸ 1.2×10⁻⁸ 2.0×10⁻⁸
2D L-BFGS-B (default ε) 0/60 1.6×10⁻⁵ 1.0×10⁻⁵ 1.3×10⁻⁵ 2.1×10⁻⁶ 3.9×10⁻⁶
2D L-BFGS-B (ε=10⁻¹⁰) 3/60 1.6×10⁻⁷ 8.9×10⁻⁷ 8.6×10⁻⁷ 1.8×10⁻⁷ 1.7×10⁻⁷
2D L-BFGS-B (ε=10⁻⁶) 20/60 1.2×10⁻³ 1.1×10⁻³ 1.3×10⁻³ 2.2×10⁻⁴ 3.8×10⁻⁴
2D Grid (256²) 0/60 2.58 2.03 2.03 0.44 0.57
5D L-BFGS-B (analytical) 0/60 2.23 29.8 6.14 5.03 1.33
5D L-BFGS-B (central diff) 1/60 103 2,334 343 78.9 28.8
5D L-BFGS-B (finite diff) 2/60 113 2,334 832 80.6 44.0

Maximum |relative error| (%) across 60 fits (3 surfaces × 20 sampling ranges), computed over successful (converged) fits only. Failure counts show convergence failures out of 60 total fits.

✓ Key Result

VPNLS eliminates the biases inherent in the parabolic approximation and avoids the fragile gradient tuning that complicates L-BFGS-B. All five loss surface parameters \((E, A, B, \alpha, \beta)\) are recovered with machine precision, and extrapolation to higher compute budgets is exact.

Conclusion

The biases documented in this article are structural, not statistical. They exist on noise-free data with perfect experimental conditions. Real experiments, which contend with measurement noise and unknown optima, can only make them worse.

Two independent sources of error compound in practice. Surface asymmetry (\(\alpha \neq \beta\)) biases intercepts, and off-center sampling biases intercepts or exponents depending on whether the offset is constant or varies with compute budget. Both act simultaneously in any real experiment.

A practical alternative exists. Variable projection recovers all five surface parameters with machine precision, uses the same intuitive linear separation that makes Approach 2 appealing, and is straightforward to implement.

For practitioners using Approach 2: be aware that intercept estimates carry a systematic bias that grows with exponent asymmetry and sampling grid width. When precision matters for extrapolation to large compute budgets, consider variable projection as a robust alternative.

Limitations

Several limitations scope the conclusions of this study. We highlight the most important ones here.

Appendix

A. Detailed Method Comparison

Detailed method comparison showing per-parameter error across surfaces and sampling ranges
Figure A1: Per-parameter recovery error for nine fitting methods across three loss surfaces and 20 sampling ranges (baseline, no bias). Each panel shows absolute relative error (%) on a log scale versus sampling range, with one curve per method. Rows correspond to loss surfaces (symmetric, Chinchilla, high imbalance); columns correspond to parameters (E, A, B, α, β). Gaps indicate convergence failures.

References

  1. "Training Compute-Optimal Large Language Models," ArXiv. https://arxiv.org/abs/2203.15556
  2. "The Llama 3 Herd of Models," ArXiv. https://arxiv.org/abs/2407.21783
  3. "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism," ArXiv. https://arxiv.org/abs/2401.02954
  4. "Exploring Scaling Laws for EHR Foundation Models," ArXiv. https://arxiv.org/abs/2505.22964
  5. "Sequence modeling and design from molecular to genome scale with Evo," bioRxiv. https://www.biorxiv.org/content/10.1101/2024.02.27.582234v2
  6. "Scaling Laws for Imitation Learning in Single-Agent Games," TMLR. https://arxiv.org/abs/2307.09423
  7. "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design," NeurIPS. https://arxiv.org/abs/2305.13035
  8. "Scaling Laws of Motion Forecasting and Planning -- Technical Report," ArXiv. https://arxiv.org/abs/2506.08228
  9. "Training compute-optimal transformer encoder models," Other. https://aclanthology.org/2025.emnlp-main.1804.pdf
  10. "Scaling Laws For Diffusion Transformers," ArXiv. https://arxiv.org/abs/2410.08184
  11. "Scaling Behavior of Discrete Diffusion Language Models," ArXiv. https://arxiv.org/abs/2512.10858
  12. "Scaling Laws for Compute Optimal Biosignal Transformers," Other. https://dspacemainprd01.lib.uwaterloo.ca/server/api/core/bitstreams/b66b1078-b359-4688-8dac-45e78806eb3d/content
  13. "(Mis)fitting: A Survey of Scaling Laws," ICLR 2025. https://arxiv.org/abs/2502.18969
  14. "Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic," CVPR 2024. https://arxiv.org/abs/2404.07177
  15. "Evaluating the Robustness of Chinchilla Compute-Optimal Scaling," ArXiv. https://arxiv.org/abs/2509.23963
  16. "Scaling Laws for Neural Language Models," ArXiv. https://arxiv.org/abs/2001.08361